In this test task you will have an opportunity to demonstrate your skills of a Data Scientist from various angles - processing data, analyzing and vizalizing it, finding insights, applying predictive techniques and explaining your reasoning about it.
The task is based around a bike sharing dataset openly available at UCI Machine Learning Repository [1].
Please go through the steps below, build up the necessary code and comment on your choices.
Tasks:
import os, sys
os.getcwd()
# subfolders
print(os.listdir("data"))
print(os.listdir("output"))
'/mnt/N0326018/project'
['day.csv', '.ipynb_checkpoints', 'Readme-Data.txt', 'hour.csv'] ['analyze_dataset.html', '.ipynb_checkpoints', 'analyze_dataset_comparison.html', 'sample_submission.csv']
print(sys.prefix)
print(sys.executable)
/mnt/N0326018/project/.venv /mnt/N0326018/project/.venv/bin/python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
from scipy import stats
import seaborn as sns
import sweetviz as sv
from scipy.stats import scoreatpercentile
from statsmodels.graphics.gofplots import qqplot
import time
import math
from sklearn import preprocessing, metrics, linear_model
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, precision_score, recall_score, roc_auc_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score, cross_val_predict, train_test_split, StratifiedKFold, cross_val_score, GridSearchCV, KFold
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, MinMaxScaler
from sklearn.model_selection import train_test_split
import pickle
import warnings
warnings.filterwarnings('ignore')
# Config display options
pd.options.display.max_colwidth = 10000
pd.options.display.float_format = '{:.2f}'.format
# Display all outputs in Jupyter Notebook
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
# I want pandas to show all columns and up to * rows
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 1000)
# Environment for images
# This sets reasonable defaults for font size for
# a figure that will go in a notebook
sns.set_context("notebook")
# Set the font to be serif, rather than sans
sns.set(font='serif')
# Make the background white, and specify the
# specific font family
sns.set_style("whitegrid")
# read raw training data
df_all = pd.read_csv('data/day.csv')
df_hour_all = pd.read_csv('data/hour.csv')
# split dataset
df_last30 = df_all.tail(30) # use to test data in unseen data
df = df_all.iloc[:-30, :] # use to train data
df.head()
df.shape
| instant | dteday | season | yr | mnth | holiday | weekday | workingday | weathersit | temp | atemp | hum | windspeed | casual | registered | cnt | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 2011-01-01 | 1 | 0 | 1 | 0 | 6 | 0 | 2 | 0.34 | 0.36 | 0.81 | 0.16 | 331 | 654 | 985 |
| 1 | 2 | 2011-01-02 | 1 | 0 | 1 | 0 | 0 | 0 | 2 | 0.36 | 0.35 | 0.70 | 0.25 | 131 | 670 | 801 |
| 2 | 3 | 2011-01-03 | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 0.20 | 0.19 | 0.44 | 0.25 | 120 | 1229 | 1349 |
| 3 | 4 | 2011-01-04 | 1 | 0 | 1 | 0 | 2 | 1 | 1 | 0.20 | 0.21 | 0.59 | 0.16 | 108 | 1454 | 1562 |
| 4 | 5 | 2011-01-05 | 1 | 0 | 1 | 0 | 3 | 1 | 1 | 0.23 | 0.23 | 0.44 | 0.19 | 82 | 1518 | 1600 |
(701, 16)
Tasks:
nmax that was needed in any one day. answer heren95 that was needed in any one day. answer herenmax bicycles would cover 100% of days, n95 covers 95%, etc.)Both hour.csv and day.csv have the following fields, except hr which is not available in day.csv
- instant: record index
- dteday : date
- season : season (1:springer, 2:summer, 3:fall, 4:winter)
- yr : year (0: 2011, 1:2012)
- mnth : month ( 1 to 12)
- hr : hour (0 to 23)
- holiday : weather day is holiday or not (extracted from http://dchr.dc.gov/page/holiday-schedule)
- weekday : day of the week
- workingday : if day is neither weekend nor holiday is 1, otherwise is 0.
+ weathersit :
- 1: Clear, Few clouds, Partly cloudy, Partly cloudy
- 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
- 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
- 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
- temp : Normalized temperature in Celsius. The values are divided to 41 (max)
- atemp: Normalized feeling temperature in Celsius. The values are divided to 50 (max)
- hum: Normalized humidity. The values are divided to 100 (max)
- windspeed: Normalized wind speed. The values are divided to 67 (max)
- casual: count of casual users
- registered: count of registered users
- cnt: count of total rental bikes including both casual and registered
df_all.info()
df_all.describe()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 731 entries, 0 to 730 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 instant 731 non-null int64 1 dteday 731 non-null object 2 season 731 non-null int64 3 yr 731 non-null int64 4 mnth 731 non-null int64 5 holiday 731 non-null int64 6 weekday 731 non-null int64 7 workingday 731 non-null int64 8 weathersit 731 non-null int64 9 temp 731 non-null float64 10 atemp 731 non-null float64 11 hum 731 non-null float64 12 windspeed 731 non-null float64 13 casual 731 non-null int64 14 registered 731 non-null int64 15 cnt 731 non-null int64 dtypes: float64(4), int64(11), object(1) memory usage: 91.5+ KB
| instant | season | yr | mnth | holiday | weekday | workingday | weathersit | temp | atemp | hum | windspeed | casual | registered | cnt | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 731.00 | 731.00 | 731.00 | 731.00 | 731.00 | 731.00 | 731.00 | 731.00 | 731.00 | 731.00 | 731.00 | 731.00 | 731.00 | 731.00 | 731.00 |
| mean | 366.00 | 2.50 | 0.50 | 6.52 | 0.03 | 3.00 | 0.68 | 1.40 | 0.50 | 0.47 | 0.63 | 0.19 | 848.18 | 3656.17 | 4504.35 |
| std | 211.17 | 1.11 | 0.50 | 3.45 | 0.17 | 2.00 | 0.47 | 0.54 | 0.18 | 0.16 | 0.14 | 0.08 | 686.62 | 1560.26 | 1937.21 |
| min | 1.00 | 1.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 1.00 | 0.06 | 0.08 | 0.00 | 0.02 | 2.00 | 20.00 | 22.00 |
| 25% | 183.50 | 2.00 | 0.00 | 4.00 | 0.00 | 1.00 | 0.00 | 1.00 | 0.34 | 0.34 | 0.52 | 0.13 | 315.50 | 2497.00 | 3152.00 |
| 50% | 366.00 | 3.00 | 1.00 | 7.00 | 0.00 | 3.00 | 1.00 | 1.00 | 0.50 | 0.49 | 0.63 | 0.18 | 713.00 | 3662.00 | 4548.00 |
| 75% | 548.50 | 3.00 | 1.00 | 10.00 | 0.00 | 5.00 | 1.00 | 2.00 | 0.66 | 0.61 | 0.73 | 0.23 | 1096.00 | 4776.50 | 5956.00 |
| max | 731.00 | 4.00 | 1.00 | 12.00 | 1.00 | 6.00 | 1.00 | 3.00 | 0.86 | 0.84 | 0.97 | 0.51 | 3410.00 | 6946.00 | 8714.00 |
# Report about total dataset, with target feature
eda_report = sv.analyze([df_all,'Bike Rentals'], 'cnt')
eda_report.show_html('output/analyze_dataset.html')
eda_report.show_notebook(layout='widescreen', w=1500, h=700, scale=0.7)
| | [ 0%] 00:00 -> (? left)
Report output/analyze_dataset.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.
Comments / reasoning:
# Report comparison about split datasets, with target feature
eda_report_comparison = sv.compare([df_all, 'training data'], [df_last30, 'last 30 days'], 'cnt')
eda_report_comparison.show_html('output/analyze_dataset_comparison.html')
eda_report_comparison.show_notebook(layout='widescreen', w=1500, h=700, scale=0.7)
| | [ 0%] 00:00 -> (? left)
Report output/analyze_dataset_comparison.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.
Comments / reasoning:
# rename columns
df_all.rename(columns={'instant':'id','dteday':'datetime','yr':'year','mnth':'month','weathersit':'weather_condition',
'temp':'temperature', 'atemp':'feel_temperature', 'hum':'humidity','cnt':'total_count'},inplace=True)
df_all.head()
df_all.dtypes
| id | datetime | season | year | month | holiday | weekday | workingday | weather_condition | temperature | feel_temperature | humidity | windspeed | casual | registered | total_count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 2011-01-01 | 1 | 0 | 1 | 0 | 6 | 0 | 2 | 0.34 | 0.36 | 0.81 | 0.16 | 331 | 654 | 985 |
| 1 | 2 | 2011-01-02 | 1 | 0 | 1 | 0 | 0 | 0 | 2 | 0.36 | 0.35 | 0.70 | 0.25 | 131 | 670 | 801 |
| 2 | 3 | 2011-01-03 | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 0.20 | 0.19 | 0.44 | 0.25 | 120 | 1229 | 1349 |
| 3 | 4 | 2011-01-04 | 1 | 0 | 1 | 0 | 2 | 1 | 1 | 0.20 | 0.21 | 0.59 | 0.16 | 108 | 1454 | 1562 |
| 4 | 5 | 2011-01-05 | 1 | 0 | 1 | 0 | 3 | 1 | 1 | 0.23 | 0.23 | 0.44 | 0.19 | 82 | 1518 | 1600 |
id int64 datetime object season int64 year int64 month int64 holiday int64 weekday int64 workingday int64 weather_condition int64 temperature float64 feel_temperature float64 humidity float64 windspeed float64 casual int64 registered int64 total_count int64 dtype: object
df_all['datetime']=pd.to_datetime(df_all.datetime)
df_all['season']=df_all.season.astype('category')
df_all['year']=df_all.year.astype('category')
df_all['month']=df_all.month.astype('category')
df_all['holiday']=df_all.holiday.astype('category')
df_all['weekday']=df_all.weekday.astype('category')
df_all['workingday']=df_all.workingday.astype('category')
df_all['weather_condition']=df_all.weather_condition.astype('category')
df_all.dtypes
id int64 datetime datetime64[ns] season category year category month category holiday category weekday category workingday category weather_condition category temperature float64 feel_temperature float64 humidity float64 windspeed float64 casual int64 registered int64 total_count int64 dtype: object
print('df_all shape : ' + str(df_all.shape))
# Split dataframe by numerical and categorical columns
num_df = df_all.select_dtypes(include = ['int64', 'float64'])
cat_df = df_all.select_dtypes(include = ['object', 'bool'])
# Get list of columns with missing values
missing_num = num_df.isnull().sum()
columns_with_missing_num = missing_num[missing_num > 0]
print("**These are the NUMERIC columns with missing values:**\n{} \n"\
.format(columns_with_missing_num))
# Get list of columns with missing values
missing_cat = cat_df.isnull().sum()
columns_with_missing_cat = (missing_cat[(missing_cat > 0) & (missing_cat < len(df_all))])
print("**These are the CATEGORICAL columns with missing values:**\n{} \n"\
.format(columns_with_missing_cat))
columns_with_all_missing_num = missing_num[missing_num == len(df_all)]
columns_with_all_missing_num = list(columns_with_all_missing_num.index)
print("**These are the NUMERICAL columns with ALL missing values:**\n{} \n"\
.format(columns_with_all_missing_num))
columns_with_all_missing_cat = missing_cat[missing_cat == len(df_all)]
columns_with_all_missing_cat = list(columns_with_all_missing_cat.index)
print("**These are the CATEGORICAL columns with ALL missing values:**\n{}"\
.format(columns_with_all_missing_cat))
df_all.drop(columns_with_all_missing_num, axis = 1, inplace = True)
df_all.drop(columns_with_all_missing_cat, axis = 1, inplace = True)
df_all shape : (731, 16) **These are the NUMERIC columns with missing values:** Series([], dtype: int64) **These are the CATEGORICAL columns with missing values:** Series([], dtype: float64) **These are the NUMERICAL columns with ALL missing values:** [] **These are the CATEGORICAL columns with ALL missing values:** []
# Considering cardinality_threshold
cardinality_threshold = 10
# Get list of columns with their cardinality - don't want to consider numeric columns
categorical_columns = list(df_all.select_dtypes(exclude=[np.number]).columns)
cardinality = df_all[categorical_columns].apply(pd.Series.nunique)
columns_too_high_cardinality = list(cardinality[cardinality > cardinality_threshold].index)
print("There are {} columns with high cardinality. Threshold: {} categories."\
.format(len(columns_too_high_cardinality), cardinality_threshold))
columns_too_high_cardinality
There are 2 columns with high cardinality. Threshold: 10 categories.
['datetime', 'month']
ID_variables = ['id']
df_all.drop(ID_variables, axis = 1, inplace = True)
# casual & registered – These variables cannot be predicted
no_value_variables = ['casual', 'registered']
df_all.drop(no_value_variables, axis = 1, inplace = True)
# get list of columns with constant value
columns_constant = list(df_all.columns[df_all.nunique() <= 1])
print("There are {} columns with constant values".format(len(columns_constant)))
columns_constant
df_all.drop(columns_constant, axis = 1, inplace = True)
There are 0 columns with constant values
[]
corr_matrix = df_all.select_dtypes(exclude=[np.object]).corr().abs()
# select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
# find features with correlation > 0.95
columns_perfect_correlation = [column for column in upper.columns if any(upper[column] > 0.95)]
print("There are {} columns that are perfectly correlated with other columns: {} "\
.format(len(columns_perfect_correlation), columns_perfect_correlation))
columns_perfect_correlation
There are 1 columns that are perfectly correlated with other columns: ['feel_temperature']
['feel_temperature']
# I prefer to drop 'temp' the column atemp is more appropriate for modelling purposes, from human perspective
df_all.drop('temperature', axis=1, inplace=True)
fig,ax = plt.subplots(figsize = (5, 3) )
# boxplot for total_count outliers
sns.boxplot(data = df_all[['total_count']])
ax.set_title('total_count outliers')
plt.show()
<AxesSubplot:>
Text(0.5, 1.0, 'total_count outliers')
# plot box plot of categorical variables
plt.figure(figsize=(20, 12))
plt.subplot(3,3,1)
sns.boxplot(x = 'season', y = 'total_count', data = df_all)
plt.subplot(3,3,2)
sns.boxplot(x = 'year', y = 'total_count', data = df_all)
plt.subplot(3,3,3)
sns.boxplot(x = 'month', y = 'total_count', data = df_all)
plt.subplot(3,3,4)
sns.boxplot(x = 'holiday', y = 'total_count', data = df_all)
plt.subplot(3,3,5)
sns.boxplot(x = 'weekday', y = 'total_count', data = df_all)
plt.subplot(3,3,6)
sns.boxplot(x = 'workingday', y = 'total_count', data = df_all)
plt.subplot(3,3,7)
sns.boxplot(x = 'weather_condition', y = 'total_count', data = df_all)
plt.show()
<Figure size 2000x1200 with 0 Axes>
<AxesSubplot:>
<AxesSubplot:xlabel='season', ylabel='total_count'>
<AxesSubplot:>
<AxesSubplot:xlabel='year', ylabel='total_count'>
<AxesSubplot:>
<AxesSubplot:xlabel='month', ylabel='total_count'>
<AxesSubplot:>
<AxesSubplot:xlabel='holiday', ylabel='total_count'>
<AxesSubplot:>
<AxesSubplot:xlabel='weekday', ylabel='total_count'>
<AxesSubplot:>
<AxesSubplot:xlabel='workingday', ylabel='total_count'>
<AxesSubplot:>
<AxesSubplot:xlabel='weather_condition', ylabel='total_count'>
# plot box plot of continuous variables
plt.figure(figsize=(20, 12))
plt.subplot(2,3,1)
plt.boxplot(df_all["feel_temperature"])
plt.subplot(2,3,3)
plt.boxplot(df_all["humidity"])
plt.subplot(2,3,4)
plt.boxplot(df_all["windspeed"])
plt.show()
<Figure size 2000x1200 with 0 Axes>
<AxesSubplot:>
{'whiskers': [<matplotlib.lines.Line2D at 0x7f61db51ecd0>,
<matplotlib.lines.Line2D at 0x7f61db51efa0>],
'caps': [<matplotlib.lines.Line2D at 0x7f61db2132b0>,
<matplotlib.lines.Line2D at 0x7f61db213580>],
'boxes': [<matplotlib.lines.Line2D at 0x7f61db51ea00>],
'medians': [<matplotlib.lines.Line2D at 0x7f61db213850>],
'fliers': [<matplotlib.lines.Line2D at 0x7f61db213b20>],
'means': []}
<AxesSubplot:>
{'whiskers': [<matplotlib.lines.Line2D at 0x7f61dae14c40>,
<matplotlib.lines.Line2D at 0x7f61dae14f10>],
'caps': [<matplotlib.lines.Line2D at 0x7f61dae07220>,
<matplotlib.lines.Line2D at 0x7f61dae074f0>],
'boxes': [<matplotlib.lines.Line2D at 0x7f61dae14970>],
'medians': [<matplotlib.lines.Line2D at 0x7f61dae077c0>],
'fliers': [<matplotlib.lines.Line2D at 0x7f61dae07a90>],
'means': []}
<AxesSubplot:>
{'whiskers': [<matplotlib.lines.Line2D at 0x7f61e0d42ac0>,
<matplotlib.lines.Line2D at 0x7f61e0d42d90>],
'caps': [<matplotlib.lines.Line2D at 0x7f617c1a60a0>,
<matplotlib.lines.Line2D at 0x7f617c1a6370>],
'boxes': [<matplotlib.lines.Line2D at 0x7f61e0d427f0>],
'medians': [<matplotlib.lines.Line2D at 0x7f617c1a6640>],
'fliers': [<matplotlib.lines.Line2D at 0x7f617c1a6910>],
'means': []}
fig,ax = plt.subplots(figsize = (7, 5))
# zoom in for outliers regarding windspeed & humidity features
sns.boxplot(data = df_all[['windspeed','humidity']])
ax.set_title('Windspeed_Humidity outliers')
plt.show()
<AxesSubplot:>
Text(0.5, 1.0, 'Windspeed_Humidity outliers')
from fancyimpute import KNN
# create dataframe for outliers
outliers = pd.DataFrame(df_all, columns=['windspeed','humidity'])
# replace outliers by n/a
columns = ['windspeed','humidity']
for i in columns:
q75, q25 = np.percentile(outliers.loc[:,i],[75,25]) # Split data in 2 diff quantiles
iqr = q75 - q25 # inter quantile range
min = q25 - (iqr*1.5)
max = q75 + (iqr*1.5)
outliers.loc[outliers.loc[:,i] < min, :i] = np.nan
outliers.loc[outliers.loc[:,i] > max, :i] = np.nan
# impute outliers by using the average
outliers['windspeed'] = outliers['windspeed'].fillna(outliers['windspeed'].mean())
outliers['humidity'] = outliers['humidity'].fillna(outliers['humidity'].mean())
#Replacing the imputated windspeed
df_all['windspeed'] = df_all['windspeed'].replace(outliers['windspeed'])
#Replacing the imputated humidity
df_all['humidity'] = df_all['humidity'].replace(outliers['humidity'])
df_all.head(5)
| datetime | season | year | month | holiday | weekday | workingday | weather_condition | feel_temperature | humidity | windspeed | total_count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2011-01-01 | 1 | 0 | 1 | 0 | 6 | 0 | 2 | 0.36 | 0.81 | 0.16 | 985 |
| 1 | 2011-01-02 | 1 | 0 | 1 | 0 | 0 | 0 | 2 | 0.35 | 0.70 | 0.25 | 801 |
| 2 | 2011-01-03 | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 0.19 | 0.44 | 0.25 | 1349 |
| 3 | 2011-01-04 | 1 | 0 | 1 | 0 | 2 | 1 | 1 | 0.21 | 0.59 | 0.16 | 1562 |
| 4 | 2011-01-05 | 1 | 0 | 1 | 0 | 3 | 1 | 1 | 0.23 | 0.44 | 0.19 | 1600 |
fig = plt.figure(figsize=(15,8))
stats.probplot(df_all.total_count.tolist(), dist='norm',plot=plt)
plt.show()
((array([-3.10612952, -2.83371839, -2.68121219, -2.57340905, -2.48915191,
-2.41955673, -2.36001798, -2.30782877, -2.26125818, -2.21912992,
-2.18060696, -2.14507173, -2.11205508, -2.08119197, -2.05219258,
-2.0248228 , -1.99889075, -1.97423711, -1.95072808, -1.92825019,
-1.90670633, -1.88601273, -1.8660966 , -1.84689427, -1.82834975,
-1.81041348, -1.79304141, -1.77619419, -1.75983653, -1.74393663,
-1.72846577, -1.71339788, -1.69870925, -1.68437825, -1.67038506,
-1.6567115 , -1.64334086, -1.63025771, -1.6174478 , -1.60489794,
-1.59259587, -1.58053022, -1.56869036, -1.55706641, -1.54564912,
-1.53442983, -1.52340042, -1.51255328, -1.50188124, -1.49137757,
-1.4810359 , -1.47085025, -1.46081495, -1.45092464, -1.44117426,
-1.431559 , -1.4220743 , -1.41271583, -1.40347947, -1.39436132,
-1.38535765, -1.3764649 , -1.36767969, -1.35899879, -1.35041911,
-1.3419377 , -1.33355173, -1.32525852, -1.31705546, -1.30894008,
-1.30091001, -1.29296295, -1.28509673, -1.27730922, -1.26959842,
-1.26196238, -1.25439922, -1.24690714, -1.2394844 , -1.23212934,
-1.22484033, -1.21761582, -1.21045431, -1.20335435, -1.19631454,
-1.18933352, -1.18240998, -1.17554267, -1.16873035, -1.16197185,
-1.155266 , -1.14861171, -1.14200789, -1.13545351, -1.12894754,
-1.12248901, -1.11607697, -1.10971049, -1.10338867, -1.09711064,
-1.09087556, -1.08468261, -1.07853098, -1.07241989, -1.06634859,
-1.06031635, -1.05432244, -1.04836618, -1.04244688, -1.03656388,
-1.03071654, -1.02490423, -1.01912634, -1.01338227, -1.00767144,
-1.0019933 , -0.99634727, -0.99073283, -0.98514945, -0.9795966 ,
-0.97407381, -0.96858056, -0.96311639, -0.95768082, -0.9522734 ,
-0.94689368, -0.94154123, -0.93621562, -0.93091643, -0.92564325,
-0.92039569, -0.91517335, -0.90997585, -0.90480282, -0.89965388,
-0.89452869, -0.88942689, -0.88434814, -0.87929209, -0.87425842,
-0.86924681, -0.86425694, -0.85928849, -0.85434116, -0.84941466,
-0.84450869, -0.83962296, -0.83475719, -0.8299111 , -0.82508442,
-0.8202769 , -0.81548825, -0.81071823, -0.80596659, -0.80123308,
-0.79651745, -0.79181947, -0.78713889, -0.7824755 , -0.77782907,
-0.77319937, -0.76858618, -0.76398929, -0.75940848, -0.75484356,
-0.75029432, -0.74576055, -0.74124205, -0.73673864, -0.73225012,
-0.72777631, -0.72331702, -0.71887206, -0.71444126, -0.71002444,
-0.70562143, -0.70123206, -0.69685616, -0.69249356, -0.6881441 ,
-0.68380762, -0.67948396, -0.67517297, -0.67087448, -0.66658836,
-0.66231445, -0.6580526 , -0.65380267, -0.64956452, -0.64533801,
-0.64112299, -0.63691932, -0.63272689, -0.62854555, -0.62437516,
-0.62021561, -0.61606676, -0.61192849, -0.60780067, -0.60368318,
-0.59957591, -0.59547872, -0.5913915 , -0.58731414, -0.58324652,
-0.57918853, -0.57514005, -0.57110098, -0.5670712 , -0.56305061,
-0.5590391 , -0.55503657, -0.55104291, -0.54705802, -0.5430818 ,
-0.53911414, -0.53515496, -0.53120414, -0.5272616 , -0.52332724,
-0.51940096, -0.51548267, -0.51157228, -0.5076697 , -0.50377484,
-0.4998876 , -0.4960079 , -0.49213565, -0.48827077, -0.48441317,
-0.48056276, -0.47671946, -0.4728832 , -0.46905388, -0.46523142,
-0.46141575, -0.45760679, -0.45380445, -0.45000867, -0.44621936,
-0.44243644, -0.43865984, -0.43488949, -0.43112532, -0.42736724,
-0.42361519, -0.41986909, -0.41612887, -0.41239447, -0.40866581,
-0.40494282, -0.40122544, -0.39751359, -0.39380721, -0.39010623,
-0.38641059, -0.38272022, -0.37903506, -0.37535503, -0.37168008,
-0.36801015, -0.36434516, -0.36068506, -0.35702979, -0.35337928,
-0.34973347, -0.34609231, -0.34245573, -0.33882367, -0.33519607,
-0.33157289, -0.32795405, -0.3243395 , -0.32072918, -0.31712304,
-0.31352101, -0.30992305, -0.3063291 , -0.3027391 , -0.299153 ,
-0.29557074, -0.29199227, -0.28841753, -0.28484648, -0.28127906,
-0.27771521, -0.27415488, -0.27059803, -0.2670446 , -0.26349453,
-0.25994779, -0.25640431, -0.25286405, -0.24932695, -0.24579297,
-0.24226206, -0.23873416, -0.23520924, -0.23168723, -0.2281681 ,
-0.22465178, -0.22113825, -0.21762744, -0.21411931, -0.21061382,
-0.20711091, -0.20361054, -0.20011267, -0.19661724, -0.19312421,
-0.18963354, -0.18614518, -0.18265908, -0.17917519, -0.17569349,
-0.17221391, -0.16873641, -0.16526095, -0.16178749, -0.15831598,
-0.15484638, -0.15137863, -0.14791271, -0.14444857, -0.14098615,
-0.13752543, -0.13406635, -0.13060888, -0.12715297, -0.12369857,
-0.12024565, -0.11679416, -0.11334407, -0.10989532, -0.10644788,
-0.1030017 , -0.09955675, -0.09611298, -0.09267034, -0.08922881,
-0.08578833, -0.08234887, -0.07891038, -0.07547282, -0.07203616,
-0.06860034, -0.06516534, -0.0617311 , -0.05829759, -0.05486477,
-0.0514326 , -0.04800103, -0.04457003, -0.04113955, -0.03770955,
-0.03428 , -0.03085085, -0.02742207, -0.0239936 , -0.02056542,
-0.01713748, -0.01370974, -0.01028217, -0.00685471, -0.00342734,
0. , 0.00342734, 0.00685471, 0.01028217, 0.01370974,
0.01713748, 0.02056542, 0.0239936 , 0.02742207, 0.03085085,
0.03428 , 0.03770955, 0.04113955, 0.04457003, 0.04800103,
0.0514326 , 0.05486477, 0.05829759, 0.0617311 , 0.06516534,
0.06860034, 0.07203616, 0.07547282, 0.07891038, 0.08234887,
0.08578833, 0.08922881, 0.09267034, 0.09611298, 0.09955675,
0.1030017 , 0.10644788, 0.10989532, 0.11334407, 0.11679416,
0.12024565, 0.12369857, 0.12715297, 0.13060888, 0.13406635,
0.13752543, 0.14098615, 0.14444857, 0.14791271, 0.15137863,
0.15484638, 0.15831598, 0.16178749, 0.16526095, 0.16873641,
0.17221391, 0.17569349, 0.17917519, 0.18265908, 0.18614518,
0.18963354, 0.19312421, 0.19661724, 0.20011267, 0.20361054,
0.20711091, 0.21061382, 0.21411931, 0.21762744, 0.22113825,
0.22465178, 0.2281681 , 0.23168723, 0.23520924, 0.23873416,
0.24226206, 0.24579297, 0.24932695, 0.25286405, 0.25640431,
0.25994779, 0.26349453, 0.2670446 , 0.27059803, 0.27415488,
0.27771521, 0.28127906, 0.28484648, 0.28841753, 0.29199227,
0.29557074, 0.299153 , 0.3027391 , 0.3063291 , 0.30992305,
0.31352101, 0.31712304, 0.32072918, 0.3243395 , 0.32795405,
0.33157289, 0.33519607, 0.33882367, 0.34245573, 0.34609231,
0.34973347, 0.35337928, 0.35702979, 0.36068506, 0.36434516,
0.36801015, 0.37168008, 0.37535503, 0.37903506, 0.38272022,
0.38641059, 0.39010623, 0.39380721, 0.39751359, 0.40122544,
0.40494282, 0.40866581, 0.41239447, 0.41612887, 0.41986909,
0.42361519, 0.42736724, 0.43112532, 0.43488949, 0.43865984,
0.44243644, 0.44621936, 0.45000867, 0.45380445, 0.45760679,
0.46141575, 0.46523142, 0.46905388, 0.4728832 , 0.47671946,
0.48056276, 0.48441317, 0.48827077, 0.49213565, 0.4960079 ,
0.4998876 , 0.50377484, 0.5076697 , 0.51157228, 0.51548267,
0.51940096, 0.52332724, 0.5272616 , 0.53120414, 0.53515496,
0.53911414, 0.5430818 , 0.54705802, 0.55104291, 0.55503657,
0.5590391 , 0.56305061, 0.5670712 , 0.57110098, 0.57514005,
0.57918853, 0.58324652, 0.58731414, 0.5913915 , 0.59547872,
0.59957591, 0.60368318, 0.60780067, 0.61192849, 0.61606676,
0.62021561, 0.62437516, 0.62854555, 0.63272689, 0.63691932,
0.64112299, 0.64533801, 0.64956452, 0.65380267, 0.6580526 ,
0.66231445, 0.66658836, 0.67087448, 0.67517297, 0.67948396,
0.68380762, 0.6881441 , 0.69249356, 0.69685616, 0.70123206,
0.70562143, 0.71002444, 0.71444126, 0.71887206, 0.72331702,
0.72777631, 0.73225012, 0.73673864, 0.74124205, 0.74576055,
0.75029432, 0.75484356, 0.75940848, 0.76398929, 0.76858618,
0.77319937, 0.77782907, 0.7824755 , 0.78713889, 0.79181947,
0.79651745, 0.80123308, 0.80596659, 0.81071823, 0.81548825,
0.8202769 , 0.82508442, 0.8299111 , 0.83475719, 0.83962296,
0.84450869, 0.84941466, 0.85434116, 0.85928849, 0.86425694,
0.86924681, 0.87425842, 0.87929209, 0.88434814, 0.88942689,
0.89452869, 0.89965388, 0.90480282, 0.90997585, 0.91517335,
0.92039569, 0.92564325, 0.93091643, 0.93621562, 0.94154123,
0.94689368, 0.9522734 , 0.95768082, 0.96311639, 0.96858056,
0.97407381, 0.9795966 , 0.98514945, 0.99073283, 0.99634727,
1.0019933 , 1.00767144, 1.01338227, 1.01912634, 1.02490423,
1.03071654, 1.03656388, 1.04244688, 1.04836618, 1.05432244,
1.06031635, 1.06634859, 1.07241989, 1.07853098, 1.08468261,
1.09087556, 1.09711064, 1.10338867, 1.10971049, 1.11607697,
1.12248901, 1.12894754, 1.13545351, 1.14200789, 1.14861171,
1.155266 , 1.16197185, 1.16873035, 1.17554267, 1.18240998,
1.18933352, 1.19631454, 1.20335435, 1.21045431, 1.21761582,
1.22484033, 1.23212934, 1.2394844 , 1.24690714, 1.25439922,
1.26196238, 1.26959842, 1.27730922, 1.28509673, 1.29296295,
1.30091001, 1.30894008, 1.31705546, 1.32525852, 1.33355173,
1.3419377 , 1.35041911, 1.35899879, 1.36767969, 1.3764649 ,
1.38535765, 1.39436132, 1.40347947, 1.41271583, 1.4220743 ,
1.431559 , 1.44117426, 1.45092464, 1.46081495, 1.47085025,
1.4810359 , 1.49137757, 1.50188124, 1.51255328, 1.52340042,
1.53442983, 1.54564912, 1.55706641, 1.56869036, 1.58053022,
1.59259587, 1.60489794, 1.6174478 , 1.63025771, 1.64334086,
1.6567115 , 1.67038506, 1.68437825, 1.69870925, 1.71339788,
1.72846577, 1.74393663, 1.75983653, 1.77619419, 1.79304141,
1.81041348, 1.82834975, 1.84689427, 1.8660966 , 1.88601273,
1.90670633, 1.92825019, 1.95072808, 1.97423711, 1.99889075,
2.0248228 , 2.05219258, 2.08119197, 2.11205508, 2.14507173,
2.18060696, 2.21912992, 2.26125818, 2.30782877, 2.36001798,
2.41955673, 2.48915191, 2.57340905, 2.68121219, 2.83371839,
3.10612952]),
array([ 22, 431, 441, 506, 605, 623, 627, 683, 705, 754, 795,
801, 822, 920, 959, 981, 985, 986, 1000, 1005, 1011, 1013,
1027, 1096, 1096, 1098, 1107, 1115, 1162, 1162, 1167, 1204, 1248,
1263, 1301, 1317, 1321, 1341, 1349, 1360, 1406, 1416, 1421, 1446,
1450, 1461, 1471, 1472, 1495, 1501, 1510, 1526, 1529, 1530, 1536,
1538, 1543, 1550, 1562, 1589, 1600, 1605, 1606, 1607, 1623, 1635,
1650, 1683, 1685, 1685, 1693, 1708, 1712, 1746, 1749, 1787, 1795,
1796, 1807, 1812, 1815, 1817, 1834, 1842, 1851, 1865, 1872, 1891,
1913, 1917, 1927, 1944, 1951, 1969, 1977, 1977, 1985, 1996, 2028,
2034, 2046, 2056, 2077, 2077, 2114, 2115, 2121, 2132, 2133, 2134,
2162, 2169, 2177, 2192, 2209, 2210, 2227, 2236, 2252, 2277, 2294,
2298, 2302, 2311, 2368, 2376, 2395, 2402, 2416, 2417, 2423, 2424,
2424, 2425, 2425, 2429, 2431, 2432, 2455, 2471, 2475, 2485, 2493,
2496, 2566, 2594, 2633, 2659, 2660, 2689, 2703, 2710, 2729, 2732,
2739, 2743, 2744, 2765, 2792, 2802, 2808, 2832, 2843, 2895, 2913,
2914, 2918, 2927, 2933, 2935, 2947, 2999, 3005, 3053, 3068, 3068,
3071, 3095, 3115, 3117, 3126, 3129, 3141, 3163, 3190, 3194, 3204,
3214, 3214, 3228, 3239, 3243, 3249, 3267, 3272, 3285, 3292, 3310,
3322, 3331, 3333, 3348, 3351, 3351, 3368, 3372, 3376, 3387, 3389,
3392, 3403, 3409, 3422, 3423, 3425, 3429, 3456, 3485, 3487, 3510,
3520, 3523, 3542, 3544, 3570, 3574, 3577, 3598, 3606, 3613, 3614,
3620, 3623, 3624, 3641, 3644, 3649, 3659, 3663, 3669, 3709, 3717,
3727, 3740, 3744, 3747, 3750, 3761, 3767, 3777, 3784, 3784, 3785,
3786, 3805, 3811, 3820, 3830, 3831, 3840, 3846, 3855, 3867, 3872,
3873, 3894, 3907, 3910, 3915, 3922, 3926, 3940, 3944, 3956, 3958,
3959, 3974, 3974, 3982, 4010, 4023, 4035, 4036, 4040, 4046, 4058,
4066, 4067, 4068, 4073, 4073, 4075, 4086, 4094, 4097, 4098, 4098,
4105, 4109, 4118, 4120, 4123, 4127, 4128, 4150, 4151, 4153, 4154,
4169, 4182, 4186, 4187, 4189, 4191, 4195, 4195, 4205, 4220, 4258,
4266, 4270, 4274, 4274, 4294, 4302, 4304, 4308, 4318, 4322, 4326,
4332, 4333, 4334, 4338, 4339, 4342, 4352, 4359, 4362, 4363, 4367,
4375, 4378, 4381, 4390, 4400, 4401, 4401, 4433, 4451, 4456, 4458,
4459, 4459, 4460, 4475, 4484, 4486, 4492, 4507, 4509, 4511, 4521,
4539, 4541, 4548, 4549, 4553, 4563, 4569, 4570, 4575, 4576, 4579,
4585, 4586, 4590, 4592, 4595, 4602, 4608, 4629, 4630, 4634, 4639,
4648, 4649, 4649, 4656, 4660, 4661, 4665, 4669, 4672, 4677, 4679,
4687, 4694, 4708, 4713, 4714, 4717, 4725, 4727, 4744, 4748, 4758,
4758, 4760, 4763, 4765, 4773, 4780, 4785, 4788, 4790, 4792, 4795,
4803, 4826, 4833, 4835, 4839, 4840, 4844, 4845, 4862, 4864, 4866,
4881, 4891, 4905, 4906, 4911, 4916, 4917, 4940, 4966, 4968, 4972,
4978, 4985, 4990, 4991, 4996, 5008, 5010, 5020, 5026, 5035, 5041,
5046, 5047, 5058, 5062, 5084, 5087, 5099, 5102, 5107, 5115, 5115,
5117, 5119, 5119, 5130, 5138, 5146, 5169, 5170, 5180, 5191, 5191,
5202, 5202, 5204, 5217, 5225, 5255, 5259, 5260, 5260, 5267, 5298,
5302, 5305, 5312, 5312, 5315, 5319, 5323, 5336, 5342, 5345, 5362,
5375, 5382, 5409, 5409, 5423, 5424, 5445, 5459, 5463, 5464, 5478,
5495, 5499, 5501, 5511, 5515, 5531, 5532, 5538, 5557, 5558, 5566,
5572, 5582, 5585, 5611, 5629, 5633, 5634, 5668, 5686, 5687, 5698,
5698, 5713, 5728, 5729, 5740, 5743, 5786, 5805, 5810, 5823, 5847,
5847, 5870, 5875, 5892, 5895, 5905, 5918, 5923, 5936, 5976, 5986,
5992, 6031, 6034, 6041, 6043, 6043, 6053, 6073, 6093, 6118, 6133,
6140, 6153, 6169, 6192, 6196, 6203, 6207, 6211, 6227, 6230, 6233,
6234, 6235, 6241, 6269, 6273, 6290, 6296, 6299, 6304, 6312, 6359,
6370, 6392, 6398, 6421, 6436, 6457, 6460, 6530, 6536, 6536, 6544,
6565, 6569, 6572, 6591, 6591, 6597, 6598, 6606, 6624, 6639, 6660,
6664, 6685, 6691, 6734, 6770, 6772, 6778, 6779, 6784, 6786, 6824,
6824, 6825, 6830, 6852, 6855, 6857, 6861, 6864, 6869, 6871, 6879,
6883, 6883, 6889, 6891, 6904, 6917, 6966, 6969, 6978, 6998, 7001,
7006, 7013, 7030, 7040, 7055, 7058, 7105, 7109, 7112, 7129, 7132,
7148, 7175, 7216, 7261, 7264, 7273, 7282, 7286, 7290, 7328, 7333,
7335, 7338, 7347, 7350, 7359, 7363, 7375, 7384, 7393, 7403, 7410,
7415, 7421, 7424, 7429, 7436, 7442, 7444, 7446, 7458, 7460, 7461,
7466, 7494, 7498, 7499, 7504, 7509, 7525, 7534, 7534, 7538, 7570,
7572, 7580, 7582, 7591, 7592, 7605, 7639, 7641, 7665, 7691, 7693,
7697, 7702, 7713, 7720, 7733, 7736, 7765, 7767, 7804, 7836, 7852,
7865, 7870, 7907, 7965, 8009, 8090, 8120, 8156, 8167, 8173, 8227,
8294, 8362, 8395, 8555, 8714])),
(1925.1274361641945, 4504.3488372093025, 0.9908084868276722))
# using Pearson Correlation
df = df_all[df_all.columns]
plt.figure(figsize=(12,10))
cor = df.corr()
sns.heatmap(cor, annot=True, cmap=plt.cm.Blues)
plt.show()
<Figure size 1200x1000 with 0 Axes>
<AxesSubplot:>
Comments / reasoning:
season_type = pd.get_dummies(df_all['season'], drop_first = True)
season_type.rename(columns={2:"season_summer", 3:"season_fall", 4:"season_winter"},inplace=True)
season_type.head()
weather_type = pd.get_dummies(df_all['weather_condition'], drop_first = True)
weather_type.rename(columns={2:"weather_mist_cloud", 3:"weather_light_snow_rain"},inplace=True)
weather_type.head()
| season_summer | season_fall | season_winter | |
|---|---|---|---|
| 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 |
| 2 | 0 | 0 | 0 |
| 3 | 0 | 0 | 0 |
| 4 | 0 | 0 | 0 |
| weather_mist_cloud | weather_light_snow_rain | |
|---|---|---|
| 0 | 1 | 0 |
| 1 | 1 | 0 |
| 2 | 0 | 0 |
| 3 | 0 | 0 |
| 4 | 0 | 0 |
# concatenate new dummy variables to df_all
df_all = pd.concat([df_all, season_type, weather_type], axis = 1)
# drop previous columns season & weathersit
df_all.drop(columns=["season", "weather_condition"],axis=1, inplace =True)
df_all.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 731 entries, 0 to 730 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 datetime 731 non-null datetime64[ns] 1 year 731 non-null category 2 month 731 non-null category 3 holiday 731 non-null category 4 weekday 731 non-null category 5 workingday 731 non-null category 6 feel_temperature 731 non-null float64 7 humidity 731 non-null float64 8 windspeed 731 non-null float64 9 total_count 731 non-null int64 10 season_summer 731 non-null uint8 11 season_fall 731 non-null uint8 12 season_winter 731 non-null uint8 13 weather_mist_cloud 731 non-null uint8 14 weather_light_snow_rain 731 non-null uint8 dtypes: category(5), datetime64[ns](1), float64(3), int64(1), uint8(5) memory usage: 36.9 KB
nmax that was needed in any one day.n95 that was needed in any one day.nmax bicycles would cover 100% of days, n95 covers 95%, etc.)# calculate rentals of bikes per day of the week
total_rents_by_day = df_all[['datetime', 'total_count']]
#total_rents_by_day
# visualize data
plt.figure(figsize = (30, 15))
fig = px.line(total_rents_by_day, x = 'datetime', y = 'total_count', title = 'Total Rentals per Day')
fig.show()
<Figure size 3000x1500 with 0 Axes>
<Figure size 3000x1500 with 0 Axes>
# Max Number of Bikes = Total requested riders / Max no of rides per bike
# Max Number of Bikes = total_count / 12
df_all["total_count_max12"] = df_all["total_count"]/12
#df_all.head()
# calculate the maximum number of bicycles nmax that was needed in any one day
nmax = pd.Series(df_all["total_count_max12"])
print("The maximum number of bicycles nmax that was needed in any one day is", round(nmax.quantile(1, 'nearest'), 1), "!")
# calculate the 95%-percentile of bicycles n95 that was needed in any one day
n95 = pd.Series(df_all["total_count_max12"])
print("The 95%-percentile of bicycles n95 that was needed in any one day is", round(nmax.quantile(0.95, 'nearest'), 1), "!")
The maximum number of bicycles nmax that was needed in any one day is 726.2 ! The 95%-percentile of bicycles n95 that was needed in any one day is 631.7 !
a = list(range(1,101))
b = [scoreatpercentile(df_all["total_count"],i) for i in a]
df2 = pd.DataFrame({'percentile': a, 'total_count': b}, columns=['percentile', 'total_count'])
fig = px.line(df2, x = 'percentile', y = 'total_count', title = 'Distribution of the Covered Days Depending on the Number of Available Bicycles')
fig
Tasks:
Bike sharing demand prediction refers to the process of forecasting the number of bicycles that will be rented within a specific time period, aiding in resource allocation and system optimization. For predicting the daily demand for bike sharing, which is a regression model, since the target variable is a quantity (over time) and consequent model evaluation to access the performance of this forecasting model, I'll use various metrics to evaluate the performance:
MAE and RMSE measure the average magnitude of the errors between the predicted and actual values.
R-squared measures the proportion of variance in the target variable, explained by the input variables.
Next step is to train a Regression Model (in this case Random Forest), which will use the potentially predictive features we have identified to forecast the “total_count” label
# define training dataset
df = df_all.iloc[:-30, :]
df.columns
df.dtypes
df.head(2)
Index(['datetime', 'year', 'month', 'holiday', 'weekday', 'workingday',
'feel_temperature', 'humidity', 'windspeed', 'total_count',
'season_summer', 'season_fall', 'season_winter', 'weather_mist_cloud',
'weather_light_snow_rain', 'total_count_max12'],
dtype='object')
datetime datetime64[ns] year category month category holiday category weekday category workingday category feel_temperature float64 humidity float64 windspeed float64 total_count int64 season_summer uint8 season_fall uint8 season_winter uint8 weather_mist_cloud uint8 weather_light_snow_rain uint8 total_count_max12 float64 dtype: object
| datetime | year | month | holiday | weekday | workingday | feel_temperature | humidity | windspeed | total_count | season_summer | season_fall | season_winter | weather_mist_cloud | weather_light_snow_rain | total_count_max12 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2011-01-01 | 0 | 1 | 0 | 6 | 0 | 0.36 | 0.81 | 0.16 | 985 | 0 | 0 | 0 | 1 | 0 | 82.08 |
| 1 | 2011-01-02 | 0 | 1 | 0 | 0 | 0 | 0.35 | 0.70 | 0.25 | 801 | 0 | 0 | 0 | 1 | 0 | 66.75 |
# dump not needed columns
training_data = df.drop(['datetime', 'total_count_max12'], axis=1)
# move total_count as last column
training_data = training_data[ [ col for col in training_data.columns if col != 'total_count' ] + ['total_count']]
training_data.head()
| year | month | holiday | weekday | workingday | feel_temperature | humidity | windspeed | season_summer | season_fall | season_winter | weather_mist_cloud | weather_light_snow_rain | total_count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 1 | 0 | 6 | 0 | 0.36 | 0.81 | 0.16 | 0 | 0 | 0 | 1 | 0 | 985 |
| 1 | 0 | 1 | 0 | 0 | 0 | 0.35 | 0.70 | 0.25 | 0 | 0 | 0 | 1 | 0 | 801 |
| 2 | 0 | 1 | 0 | 1 | 1 | 0.19 | 0.44 | 0.25 | 0 | 0 | 0 | 0 | 0 | 1349 |
| 3 | 0 | 1 | 0 | 2 | 1 | 0.21 | 0.59 | 0.16 | 0 | 0 | 0 | 0 | 0 | 1562 |
| 4 | 0 | 1 | 0 | 3 | 1 | 0.23 | 0.44 | 0.19 | 0 | 0 | 0 | 0 | 0 | 1600 |
# split the dataset into the train and test data
X_train, X_test, y_train, y_test = train_test_split(training_data.iloc[:,0:-1], training_data.iloc[:,-1], test_size = 0.2, random_state = 0)
print('x train :', X_train.shape,'\t\tx test :', X_test.shape)
print('y train :', y_train.shape,'\t\ty test :', y_test.shape)
x train : (560, 13) x test : (141, 13) y train : (560,) y test : (141,)
# create a new dataset for train attributes
train_attributes = X_train[X_train.columns]
# create a new dataset for test attributes
test_attributes = X_test[X_test.columns]
# split dataframe by numerical and categorical columns
num_cols = X_train.select_dtypes(include = ['uint8', 'int64', 'float64']).columns
cat_cols = X_train.select_dtypes(include = ['object', 'bool', 'category']).columns
print("There are {} numeric columns and {} categorical columns".format(len(num_cols), len(cat_cols)))
There are 8 numeric columns and 5 categorical columns
# get dummy variables to encode the categorical features to numeric
train_encoded_attributes = pd.get_dummies(train_attributes, columns = cat_cols)
print('Shape of transfomed dataframe:', train_encoded_attributes.shape)
train_encoded_attributes.head(2)
Shape of transfomed dataframe: (560, 33)
| feel_temperature | humidity | windspeed | season_summer | season_fall | season_winter | weather_mist_cloud | weather_light_snow_rain | year_0 | year_1 | month_1 | month_2 | month_3 | month_4 | month_5 | month_6 | month_7 | month_8 | month_9 | month_10 | month_11 | month_12 | holiday_0 | holiday_1 | weekday_0 | weekday_1 | weekday_2 | weekday_3 | weekday_4 | weekday_5 | weekday_6 | workingday_0 | workingday_1 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 572 | 0.74 | 0.60 | 0.28 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
| 45 | 0.25 | 0.31 | 0.29 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
# training dataset for modelling
X_train = train_encoded_attributes
y_train = y_train #.total_count.values
# training the model
X_train = train_encoded_attributes
model = RandomForestRegressor(random_state = 0, n_estimators = 200)
# fit the trained model
model.fit(X_train, y_train)
RandomForestRegressor(n_estimators=200, random_state=0)
Cross-validation is used to estimate the performance of machine learning models, more specificaly, it is used to protect against overfitting in a predictive model, particularly in a case where the amount of data may be limited.
predict = cross_val_predict(model, X_train, y_train, cv=3)
# cross validation prediction plot
fig,ax = plt.subplots(figsize=(15,8))
ax.scatter(y_train, y_train-predict)
ax.axhline(lw=2,color='black')
ax.set_title('Cross validation prediction plot')
ax.set_xlabel('Observed')
ax.set_ylabel('Residual')
#plt.show()
# calculate equation for trendline
z = np.polyfit(y_train, y_train-predict, 1)
p = np.poly1d(z)
# add trendline to plot
plt.plot(y_train, p(y_train), color="lightgreen", linewidth=3, linestyle="--")
<matplotlib.collections.PathCollection at 0x7f617a1e5cd0>
<matplotlib.lines.Line2D at 0x7f617ec12100>
Text(0.5, 1.0, 'Cross validation prediction plot')
Text(0.5, 0, 'Observed')
Text(0, 0.5, 'Residual')
[<matplotlib.lines.Line2D at 0x7f617a1d1700>]
# R-squared scores
r2_scores = cross_val_score(model, X_train, y_train, cv=5)
print('R^2 scores :', np.average(r2_scores))
R^2 scores : 0.8567570676393114
Answers / comments / reasoning:
# get dummy variables to encode the categorical features to numeric
test_encoded_attributes=pd.get_dummies(test_attributes,columns=cat_cols)
print('Shape of transformed dataframe :', test_encoded_attributes.shape)
test_encoded_attributes.head(2)
Shape of transformed dataframe : (141, 33)
| feel_temperature | humidity | windspeed | season_summer | season_fall | season_winter | weather_mist_cloud | weather_light_snow_rain | year_0 | year_1 | month_1 | month_2 | month_3 | month_4 | month_5 | month_6 | month_7 | month_8 | month_9 | month_10 | month_11 | month_12 | holiday_0 | holiday_1 | weekday_0 | weekday_1 | weekday_2 | weekday_3 | weekday_4 | weekday_5 | weekday_6 | workingday_0 | workingday_1 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 456 | 0.42 | 0.68 | 0.17 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 675 | 0.28 | 0.57 | 0.17 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
# predict the model
X_test = test_encoded_attributes
y_pred = model.predict(X_test)
# R-squared scores
r2_scores = cross_val_score(model, X_test, y_test, cv=5)
print('R^2 scores :', np.average(r2_scores))
R^2 scores : 0.8025713291399266
# find best value for n_estimators
max = 0
index = -1
for i in range(10, 200):
model = RandomForestRegressor(random_state = 0, n_estimators = i)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
r2_score = metrics.r2_score(y_test, y_pred)
if r2_score > max:
index = i
max = r2_score
RandomForestRegressor(n_estimators=10, random_state=0)
RandomForestRegressor(n_estimators=11, random_state=0)
RandomForestRegressor(n_estimators=12, random_state=0)
RandomForestRegressor(n_estimators=13, random_state=0)
RandomForestRegressor(n_estimators=14, random_state=0)
RandomForestRegressor(n_estimators=15, random_state=0)
RandomForestRegressor(n_estimators=16, random_state=0)
RandomForestRegressor(n_estimators=17, random_state=0)
RandomForestRegressor(n_estimators=18, random_state=0)
RandomForestRegressor(n_estimators=19, random_state=0)
RandomForestRegressor(n_estimators=20, random_state=0)
RandomForestRegressor(n_estimators=21, random_state=0)
RandomForestRegressor(n_estimators=22, random_state=0)
RandomForestRegressor(n_estimators=23, random_state=0)
RandomForestRegressor(n_estimators=24, random_state=0)
RandomForestRegressor(n_estimators=25, random_state=0)
RandomForestRegressor(n_estimators=26, random_state=0)
RandomForestRegressor(n_estimators=27, random_state=0)
RandomForestRegressor(n_estimators=28, random_state=0)
RandomForestRegressor(n_estimators=29, random_state=0)
RandomForestRegressor(n_estimators=30, random_state=0)
RandomForestRegressor(n_estimators=31, random_state=0)
RandomForestRegressor(n_estimators=32, random_state=0)
RandomForestRegressor(n_estimators=33, random_state=0)
RandomForestRegressor(n_estimators=34, random_state=0)
RandomForestRegressor(n_estimators=35, random_state=0)
RandomForestRegressor(n_estimators=36, random_state=0)
RandomForestRegressor(n_estimators=37, random_state=0)
RandomForestRegressor(n_estimators=38, random_state=0)
RandomForestRegressor(n_estimators=39, random_state=0)
RandomForestRegressor(n_estimators=40, random_state=0)
RandomForestRegressor(n_estimators=41, random_state=0)
RandomForestRegressor(n_estimators=42, random_state=0)
RandomForestRegressor(n_estimators=43, random_state=0)
RandomForestRegressor(n_estimators=44, random_state=0)
RandomForestRegressor(n_estimators=45, random_state=0)
RandomForestRegressor(n_estimators=46, random_state=0)
RandomForestRegressor(n_estimators=47, random_state=0)
RandomForestRegressor(n_estimators=48, random_state=0)
RandomForestRegressor(n_estimators=49, random_state=0)
RandomForestRegressor(n_estimators=50, random_state=0)
RandomForestRegressor(n_estimators=51, random_state=0)
RandomForestRegressor(n_estimators=52, random_state=0)
RandomForestRegressor(n_estimators=53, random_state=0)
RandomForestRegressor(n_estimators=54, random_state=0)
RandomForestRegressor(n_estimators=55, random_state=0)
RandomForestRegressor(n_estimators=56, random_state=0)
RandomForestRegressor(n_estimators=57, random_state=0)
RandomForestRegressor(n_estimators=58, random_state=0)
RandomForestRegressor(n_estimators=59, random_state=0)
RandomForestRegressor(n_estimators=60, random_state=0)
RandomForestRegressor(n_estimators=61, random_state=0)
RandomForestRegressor(n_estimators=62, random_state=0)
RandomForestRegressor(n_estimators=63, random_state=0)
RandomForestRegressor(n_estimators=64, random_state=0)
RandomForestRegressor(n_estimators=65, random_state=0)
RandomForestRegressor(n_estimators=66, random_state=0)
RandomForestRegressor(n_estimators=67, random_state=0)
RandomForestRegressor(n_estimators=68, random_state=0)
RandomForestRegressor(n_estimators=69, random_state=0)
RandomForestRegressor(n_estimators=70, random_state=0)
RandomForestRegressor(n_estimators=71, random_state=0)
RandomForestRegressor(n_estimators=72, random_state=0)
RandomForestRegressor(n_estimators=73, random_state=0)
RandomForestRegressor(n_estimators=74, random_state=0)
RandomForestRegressor(n_estimators=75, random_state=0)
RandomForestRegressor(n_estimators=76, random_state=0)
RandomForestRegressor(n_estimators=77, random_state=0)
RandomForestRegressor(n_estimators=78, random_state=0)
RandomForestRegressor(n_estimators=79, random_state=0)
RandomForestRegressor(n_estimators=80, random_state=0)
RandomForestRegressor(n_estimators=81, random_state=0)
RandomForestRegressor(n_estimators=82, random_state=0)
RandomForestRegressor(n_estimators=83, random_state=0)
RandomForestRegressor(n_estimators=84, random_state=0)
RandomForestRegressor(n_estimators=85, random_state=0)
RandomForestRegressor(n_estimators=86, random_state=0)
RandomForestRegressor(n_estimators=87, random_state=0)
RandomForestRegressor(n_estimators=88, random_state=0)
RandomForestRegressor(n_estimators=89, random_state=0)
RandomForestRegressor(n_estimators=90, random_state=0)
RandomForestRegressor(n_estimators=91, random_state=0)
RandomForestRegressor(n_estimators=92, random_state=0)
RandomForestRegressor(n_estimators=93, random_state=0)
RandomForestRegressor(n_estimators=94, random_state=0)
RandomForestRegressor(n_estimators=95, random_state=0)
RandomForestRegressor(n_estimators=96, random_state=0)
RandomForestRegressor(n_estimators=97, random_state=0)
RandomForestRegressor(n_estimators=98, random_state=0)
RandomForestRegressor(n_estimators=99, random_state=0)
RandomForestRegressor(random_state=0)
RandomForestRegressor(n_estimators=101, random_state=0)
RandomForestRegressor(n_estimators=102, random_state=0)
RandomForestRegressor(n_estimators=103, random_state=0)
RandomForestRegressor(n_estimators=104, random_state=0)
RandomForestRegressor(n_estimators=105, random_state=0)
RandomForestRegressor(n_estimators=106, random_state=0)
RandomForestRegressor(n_estimators=107, random_state=0)
RandomForestRegressor(n_estimators=108, random_state=0)
RandomForestRegressor(n_estimators=109, random_state=0)
RandomForestRegressor(n_estimators=110, random_state=0)
RandomForestRegressor(n_estimators=111, random_state=0)
RandomForestRegressor(n_estimators=112, random_state=0)
RandomForestRegressor(n_estimators=113, random_state=0)
RandomForestRegressor(n_estimators=114, random_state=0)
RandomForestRegressor(n_estimators=115, random_state=0)
RandomForestRegressor(n_estimators=116, random_state=0)
RandomForestRegressor(n_estimators=117, random_state=0)
RandomForestRegressor(n_estimators=118, random_state=0)
RandomForestRegressor(n_estimators=119, random_state=0)
RandomForestRegressor(n_estimators=120, random_state=0)
RandomForestRegressor(n_estimators=121, random_state=0)
RandomForestRegressor(n_estimators=122, random_state=0)
RandomForestRegressor(n_estimators=123, random_state=0)
RandomForestRegressor(n_estimators=124, random_state=0)
RandomForestRegressor(n_estimators=125, random_state=0)
RandomForestRegressor(n_estimators=126, random_state=0)
RandomForestRegressor(n_estimators=127, random_state=0)
RandomForestRegressor(n_estimators=128, random_state=0)
RandomForestRegressor(n_estimators=129, random_state=0)
RandomForestRegressor(n_estimators=130, random_state=0)
RandomForestRegressor(n_estimators=131, random_state=0)
RandomForestRegressor(n_estimators=132, random_state=0)
RandomForestRegressor(n_estimators=133, random_state=0)
RandomForestRegressor(n_estimators=134, random_state=0)
RandomForestRegressor(n_estimators=135, random_state=0)
RandomForestRegressor(n_estimators=136, random_state=0)
RandomForestRegressor(n_estimators=137, random_state=0)
RandomForestRegressor(n_estimators=138, random_state=0)
RandomForestRegressor(n_estimators=139, random_state=0)
RandomForestRegressor(n_estimators=140, random_state=0)
RandomForestRegressor(n_estimators=141, random_state=0)
RandomForestRegressor(n_estimators=142, random_state=0)
RandomForestRegressor(n_estimators=143, random_state=0)
RandomForestRegressor(n_estimators=144, random_state=0)
RandomForestRegressor(n_estimators=145, random_state=0)
RandomForestRegressor(n_estimators=146, random_state=0)
RandomForestRegressor(n_estimators=147, random_state=0)
RandomForestRegressor(n_estimators=148, random_state=0)
RandomForestRegressor(n_estimators=149, random_state=0)
RandomForestRegressor(n_estimators=150, random_state=0)
RandomForestRegressor(n_estimators=151, random_state=0)
RandomForestRegressor(n_estimators=152, random_state=0)
RandomForestRegressor(n_estimators=153, random_state=0)
RandomForestRegressor(n_estimators=154, random_state=0)
RandomForestRegressor(n_estimators=155, random_state=0)
RandomForestRegressor(n_estimators=156, random_state=0)
RandomForestRegressor(n_estimators=157, random_state=0)
RandomForestRegressor(n_estimators=158, random_state=0)
RandomForestRegressor(n_estimators=159, random_state=0)
RandomForestRegressor(n_estimators=160, random_state=0)
RandomForestRegressor(n_estimators=161, random_state=0)
RandomForestRegressor(n_estimators=162, random_state=0)
RandomForestRegressor(n_estimators=163, random_state=0)
RandomForestRegressor(n_estimators=164, random_state=0)
RandomForestRegressor(n_estimators=165, random_state=0)
RandomForestRegressor(n_estimators=166, random_state=0)
RandomForestRegressor(n_estimators=167, random_state=0)
RandomForestRegressor(n_estimators=168, random_state=0)
RandomForestRegressor(n_estimators=169, random_state=0)
RandomForestRegressor(n_estimators=170, random_state=0)
RandomForestRegressor(n_estimators=171, random_state=0)
RandomForestRegressor(n_estimators=172, random_state=0)
RandomForestRegressor(n_estimators=173, random_state=0)
RandomForestRegressor(n_estimators=174, random_state=0)
RandomForestRegressor(n_estimators=175, random_state=0)
RandomForestRegressor(n_estimators=176, random_state=0)
RandomForestRegressor(n_estimators=177, random_state=0)
RandomForestRegressor(n_estimators=178, random_state=0)
RandomForestRegressor(n_estimators=179, random_state=0)
RandomForestRegressor(n_estimators=180, random_state=0)
RandomForestRegressor(n_estimators=181, random_state=0)
RandomForestRegressor(n_estimators=182, random_state=0)
RandomForestRegressor(n_estimators=183, random_state=0)
RandomForestRegressor(n_estimators=184, random_state=0)
RandomForestRegressor(n_estimators=185, random_state=0)
RandomForestRegressor(n_estimators=186, random_state=0)
RandomForestRegressor(n_estimators=187, random_state=0)
RandomForestRegressor(n_estimators=188, random_state=0)
RandomForestRegressor(n_estimators=189, random_state=0)
RandomForestRegressor(n_estimators=190, random_state=0)
RandomForestRegressor(n_estimators=191, random_state=0)
RandomForestRegressor(n_estimators=192, random_state=0)
RandomForestRegressor(n_estimators=193, random_state=0)
RandomForestRegressor(n_estimators=194, random_state=0)
RandomForestRegressor(n_estimators=195, random_state=0)
RandomForestRegressor(n_estimators=196, random_state=0)
RandomForestRegressor(n_estimators=197, random_state=0)
RandomForestRegressor(n_estimators=198, random_state=0)
RandomForestRegressor(n_estimators=199, random_state=0)
# confirm same dimension for the target variable
y_test.shape
y_pred.shape
(141,)
(141,)
model_opt = RandomForestRegressor(random_state = 0, n_estimators = index)
model_opt.fit(X_train, y_train)
y_pred = model_opt.predict(X_test)
mae = metrics.mean_absolute_error(y_test, y_pred)
mse = metrics.mean_squared_error(y_test, y_pred)
me = metrics.max_error(y_test, y_pred)
print('Mean Absolute Error:', mae)
print('Mean Squared Error:', mse)
print('Max Error:', me)
RandomForestRegressor(n_estimators=195, random_state=0)
Mean Absolute Error: 567.2639388979815 Mean Squared Error: 614874.5165653429 Max Error: 2562.769230769231
# R-squared scores
r2_scores = cross_val_score(model_opt, X_test, y_test, cv=5)
print('R^2 scores :', np.average(r2_scores))
R^2 scores : 0.8026568491315675
Answers / comments / reasoning:
# save the optimized trained model as a pickle file
#saved_model = pickle.dumps('model/model_opt_18092023')
# Load the pickled model
#rf_model_pkl = pickle.loads('saved_model')
# Use the loaded pickled model to make predictions
#rf_model_pkl.predict(X_test)
# residual scatter plot
fig, ax = plt.subplots(figsize=(15,8))
residuals=y_test-y_pred
ax.scatter(y_test, residuals)
ax.axhline(lw=2, color='black')
ax.set_xlabel('Observed')
ax.set_ylabel('Residuals')
ax.set_title('Residual plot')
#plt.show()
# calculate equation for trendline
z = np.polyfit(y_test, residuals, 1)
p = np.poly1d(z)
# add trendline to plot
plt.plot(y_test, p(y_test), color="lightgreen", linewidth=3, linestyle="--")
<matplotlib.collections.PathCollection at 0x7f61da4a80a0>
<matplotlib.lines.Line2D at 0x7f617d98f6d0>
Text(0.5, 0, 'Observed')
Text(0, 0.5, 'Residuals')
Text(0.5, 1.0, 'Residual plot')
[<matplotlib.lines.Line2D at 0x7f61dbd879a0>]
Out-of-sample testing is used to evaluate the performance of a strategy on a separate set of data that was not used during the development and optimisation process.
This helps to determine whether the strategy would be able to perform well on new, unseen data
# define out-of-sample dataset
df_last30 = df_all.tail(30)
df_last30.head()
| datetime | year | month | holiday | weekday | workingday | feel_temperature | humidity | windspeed | total_count | season_summer | season_fall | season_winter | weather_mist_cloud | weather_light_snow_rain | total_count_max12 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 701 | 2012-12-02 | 1 | 12 | 0 | 0 | 0 | 0.36 | 0.82 | 0.12 | 4649 | 0 | 0 | 1 | 1 | 0 | 387.42 |
| 702 | 2012-12-03 | 1 | 12 | 0 | 1 | 1 | 0.46 | 0.77 | 0.08 | 6234 | 0 | 0 | 1 | 0 | 0 | 519.50 |
| 703 | 2012-12-04 | 1 | 12 | 0 | 2 | 1 | 0.47 | 0.73 | 0.17 | 6606 | 0 | 0 | 1 | 0 | 0 | 550.50 |
| 704 | 2012-12-05 | 1 | 12 | 0 | 3 | 1 | 0.43 | 0.48 | 0.32 | 5729 | 0 | 0 | 1 | 0 | 0 | 477.42 |
| 705 | 2012-12-06 | 1 | 12 | 0 | 4 | 1 | 0.26 | 0.51 | 0.17 | 5375 | 0 | 0 | 1 | 0 | 0 | 447.92 |
# save date variable
times = df_last30['datetime']
# dump not needed columns
testing_data = df_last30.drop(['datetime', 'total_count_max12'], axis=1)
# move total_count as last column
testing_data = testing_data[ [ col for col in testing_data.columns if col != 'total_count' ] + ['total_count']]
testing_data.head()
print('Shape of OOS dataframe :', testing_data.shape)
| year | month | holiday | weekday | workingday | feel_temperature | humidity | windspeed | season_summer | season_fall | season_winter | weather_mist_cloud | weather_light_snow_rain | total_count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 701 | 1 | 12 | 0 | 0 | 0 | 0.36 | 0.82 | 0.12 | 0 | 0 | 1 | 1 | 0 | 4649 |
| 702 | 1 | 12 | 0 | 1 | 1 | 0.46 | 0.77 | 0.08 | 0 | 0 | 1 | 0 | 0 | 6234 |
| 703 | 1 | 12 | 0 | 2 | 1 | 0.47 | 0.73 | 0.17 | 0 | 0 | 1 | 0 | 0 | 6606 |
| 704 | 1 | 12 | 0 | 3 | 1 | 0.43 | 0.48 | 0.32 | 0 | 0 | 1 | 0 | 0 | 5729 |
| 705 | 1 | 12 | 0 | 4 | 1 | 0.26 | 0.51 | 0.17 | 0 | 0 | 1 | 0 | 0 | 5375 |
Shape of OOS dataframe : (30, 14)
# create a new dataset for test attributes
testing_data_attributes = testing_data[testing_data.columns]
# split dataframe by numerical and categorical columns
num_cols = testing_data.select_dtypes(include = ['uint8', 'int64', 'float64']).columns
cat_cols = testing_data.select_dtypes(include = ['object', 'bool', 'category']).columns
print("There are {} numeric columns and {} categorical columns".format(len(num_cols), len(cat_cols)))
# get dummy variables to encode the categorical features to numeric
testing_data_encoded_attributes = pd.get_dummies(testing_data_attributes, columns=cat_cols)
# drop target variable
testing_data_encoded_attributes = testing_data_encoded_attributes.drop(['total_count'], axis = 1)
print('Shape of transformed dataframe :', testing_data_encoded_attributes.shape)
testing_data_encoded_attributes.head(2)
There are 9 numeric columns and 5 categorical columns Shape of transformed dataframe : (30, 33)
| feel_temperature | humidity | windspeed | season_summer | season_fall | season_winter | weather_mist_cloud | weather_light_snow_rain | year_0 | year_1 | month_1 | month_2 | month_3 | month_4 | month_5 | month_6 | month_7 | month_8 | month_9 | month_10 | month_11 | month_12 | holiday_0 | holiday_1 | weekday_0 | weekday_1 | weekday_2 | weekday_3 | weekday_4 | weekday_5 | weekday_6 | workingday_0 | workingday_1 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 701 | 0.36 | 0.82 | 0.12 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 702 | 0.46 | 0.77 | 0.08 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
# make predictions
y_pred_testing = model_opt.predict(testing_data_encoded_attributes)
# submit final sample
Submission = pd.DataFrame({'datetime' : times, 'pred' : y_pred_testing})
Submission.set_index('datetime', inplace = True)
Submission.to_csv('output/sample_submission.csv')
Submission
| pred | |
|---|---|
| datetime | |
| 2012-12-02 | 4151.09 |
| 2012-12-03 | 6463.96 |
| 2012-12-04 | 6707.82 |
| 2012-12-05 | 6322.25 |
| 2012-12-06 | 3339.76 |
| 2012-12-07 | 4379.96 |
| 2012-12-08 | 4517.68 |
| 2012-12-09 | 4122.96 |
| 2012-12-10 | 3984.20 |
| 2012-12-11 | 4633.91 |
| 2012-12-12 | 4518.00 |
| 2012-12-13 | 4418.54 |
| 2012-12-14 | 4437.31 |
| 2012-12-15 | 4741.36 |
| 2012-12-16 | 4373.70 |
| 2012-12-17 | 4273.49 |
| 2012-12-18 | 4951.97 |
| 2012-12-19 | 4913.23 |
| 2012-12-20 | 4577.46 |
| 2012-12-21 | 3727.81 |
| 2012-12-22 | 2820.97 |
| 2012-12-23 | 2944.27 |
| 2012-12-24 | 2836.15 |
| 2012-12-25 | 3348.11 |
| 2012-12-26 | 2815.71 |
| 2012-12-27 | 3002.12 |
| 2012-12-28 | 3216.13 |
| 2012-12-29 | 2669.78 |
| 2012-12-30 | 2853.44 |
| 2012-12-31 | 2881.16 |
Tasks: (Optional) Please share with us any free form reflection, comments or feedback you have in the context of this test task.
In summary, this notebook conducted a comprehensive analysis of daily bike rental data in two years time basis. Hands-on in data exploration, preprocessing, and feature engineering to prepare the data for modeling. The exploratory data analysis provided valuable insights into rental patterns based on different factors, such as weather, day of the week, seasonality, etc. In conclusion, this notebook gave us great insights on bike rental trends and successfully predicted rental counts using the Random Forest model.
Future improvements:
The analysis and insights presented here can provide valuable guidance for bike-sharing companies to optimize their services and meet the diverse preferences of their user base.
Please submit this notebook with your developments in .ipynb and .html formats as well as your requirements.txt file.
[1] Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.